{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "collapsed": true
      },
      "source": [
        "## Microsoft Concept Graph\n",
        "\n",
        "[Microsoft Concept Graph](https://concept.research.microsoft.com/) is a large taxonomy of terms mined from the internet, with `is-a` relations between concepts. \n",
        "\n",
        "Context Graph is available in two forms:\n",
        " * Large text file for download\n",
        " * REST API\n",
        "\n",
        "Statistics:\n",
        " * 5401933 unique concepts, \n",
        " * 12551613 unique instances\n",
        " * 87603947 `is-a` relations"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Using Web Service\n",
        "\n",
        "Web service offers different calls to estimate probability of a concept belonging to different groups. More info is available [here](https://concept.research.microsoft.com/Home/Api).\n",
        "Here is the sample URL to call: `https://concept.research.microsoft.com/api/Concept/ScoreByProb?instance=microsoft&topK=10`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "trusted": true
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'company': 0.6105356614382954,\n",
              " 'vendor': 0.08858636677518003,\n",
              " 'client': 0.048239124001183784,\n",
              " 'firm': 0.045476965571668145,\n",
              " 'large company': 0.043109401203511886,\n",
              " 'organization': 0.043010752688172046,\n",
              " 'corporation': 0.035908059583703265,\n",
              " 'brand': 0.03383644076156654,\n",
              " 'software company': 0.027522935779816515,\n",
              " 'technology company': 0.023774292196902438}"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import urllib\n",
        "import json\n",
        "import ssl\n",
        "\n",
        "def http(x):\n",
        "    ssl._create_default_https_context = ssl._create_unverified_context\n",
        "    response = urllib.request.urlopen(x)\n",
        "    data = response.read()\n",
        "    return data.decode('utf-8')\n",
        "\n",
        "def query(x):\n",
        "    return json.loads(http(\"https://concept.research.microsoft.com/api/Concept/ScoreByProb?instance={}&topK=10\".format(urllib.parse.quote(x))))\n",
        "\n",
        "query('microsoft')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's try to categorize the news titles using parent concepts. To get news titles, we will use [NewsApi.org](http://newsapi.org) service. You need to obtain your own API key in order to use the service - go to the web site and register for free developer plan."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {
        "trusted": true
      },
      "outputs": [],
      "source": [
        "newsapi_key = '<your API key here>'\n",
        "def get_news(country='us'):\n",
        "    res = json.loads(http(\"https://newsapi.org/v2/top-headlines?country={0}&apiKey={1}\".format(country,newsapi_key)))\n",
        "    return res['articles']\n",
        "\n",
        "all_titles = [x['title'] for x in get_news('us')+get_news('gb')]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['Covid-19 Live Updates: Vaccines and Boosters News - The New York Times',\n",
              " 'Ukrainians Flee Mariupol as Russian Forces Push to Take Port City - The Wall Street Journal',\n",
              " 'Bond Yields Jump, Stock Futures Rise After Powell Says Fed Is Ready to Be More Aggressive - The Wall Street Journal',\n",
              " 'Putin critic Alexei Navalny found guilty by Russian court - New York Post ',\n",
              " \"Supreme Court nominee Ketanji Brown Jackson will face questions at confirmation hearing's second day - CNN\",\n",
              " '2 teachers killed at Swedish high school, student arrested - ABC News',\n",
              " 'Clues to Covid-19’s Next Moves Come From Sewers - The Wall Street Journal',\n",
              " 'Republicans to roll dice by grilling Jackson over child-pornography sentencing decisions | TheHill - The Hill',\n",
              " '‘Clear sign’ Putin considering using chemical weapons in Ukraine, claims President Biden - The Independent',\n",
              " 'NASA confirms there are 5,000 planets outside our solar system - Daily Mail',\n",
              " \"US stocks whipsawed overnight after Fed Chair Powell's remarks - Fox Business\",\n",
              " \"'We've learned absolutely nothing': Tests could again be in short supply if Covid surges - POLITICO\",\n",
              " \"Duchess of Cambridge swaps khaki jungle gear for Vampire's Wife dress on Belize trip - Daily Mail\",\n",
              " 'China searches for victims, flight recorders after first plane crash in 12 years - Reuters',\n",
              " 'Second superyacht linked to Russian oligarch Abramovich docks in Turkey - Reuters',\n",
              " 'Live updates: Russia stops talks with Japan over sanctions - The Associated Press - en Español',\n",
              " 'Powers Remain and Threats Lurk as Women’s Sweet 16 Is Set - The New York Times',\n",
              " 'Webb Space Telescope Begins Multi-Instrument Alignment - SciTechDaily',\n",
              " \"UConn vs UCF - NCAA women's tournament second-round highlights - March Madness\",\n",
              " 'Bucking Republican Trend, Indiana Governor Vetoes Transgender Sports Bill - The New York Times',\n",
              " \"Maggie Fox dead: Coronation Street and Shameless actress dies after 'sudden accident' - Mirror Online - The Mirror\",\n",
              " 'China plane crash – live: Search for survivors continues as witness describes moment flight fell from sky - The Independent',\n",
              " 'Daniel Morgan murder: damning report condemns Met police - The Guardian',\n",
              " 'What to expect from Rishi Sunak’s Spring Statement - BBC.com',\n",
              " 'UK and Republic of Ireland in line to host Euro 2028 after no one else bids - The Guardian',\n",
              " \"Friends beg Vladimir Putin's 'lover' to persuade him to end Ukraine invasion - The Mirror\",\n",
              " 'Brass Eye’s outtakes show the brutal TV comedy was the tip of an iceberg - The Guardian',\n",
              " \"Vladimir Putin threatens civilians to break Mariupol's spirit - The Times\",\n",
              " 'Shell U-turn on Cambo oilfield would threaten green targets, say campaigners - The Guardian',\n",
              " 'St Helens dog attack: Girl aged 17 months killed at home - BBC',\n",
              " \"PlayStation to buy 'Assassin's Creed' veteran Jade Raymond's Haven Studios - NME\",\n",
              " '‘Clear sign’ Putin considering using chemical weapons in Ukraine, claims President Biden - The Independent',\n",
              " 'NASA confirms there are 5,000 planets outside our solar system - Daily Mail',\n",
              " 'Nintendo Switch finally has folders • Eurogamer.net - Eurogamer.net',\n",
              " 'FA to “find a solution” as Liverpool fan group blasts “shambolic” Wembley travel - This Is Anfield',\n",
              " 'Manchester United transfer news LIVE Erik ten Hag latest and Man Utd manager updates - Manchester Evening News',\n",
              " 'Inflation raises cost of UK government borrowing in February; crude oil up again – business live - The Guardian',\n",
              " 'Alexei Navalny: Kremlin critic found guilty of large-scale fraud and contempt of court by Russian court - Sky News',\n",
              " \"UK prepares to nationalize Russia natural gas giant Gazprom's retail unit - Business Insider\",\n",
              " 'Zaghari-Ratcliffe: Hunt calls for inquiry into delay over Iran debt payment - The Guardian']"
            ]
          },
          "execution_count": 21,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "all_titles"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First of all, we want to be able to extract nouns from news titles. We will use `TextBlob` library to do this, which simplifies a lot of typical NLP tasks like this."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "trusted": true
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: textblob in c:\\winapp\\miniconda3\\lib\\site-packages (0.17.1)\n",
            "Requirement already satisfied: nltk>=3.1 in c:\\winapp\\miniconda3\\lib\\site-packages (from textblob) (3.5)\n",
            "Requirement already satisfied: joblib in c:\\winapp\\miniconda3\\lib\\site-packages (from nltk>=3.1->textblob) (1.0.1)\n",
            "Requirement already satisfied: regex in c:\\winapp\\miniconda3\\lib\\site-packages (from nltk>=3.1->textblob) (2021.11.10)\n",
            "Requirement already satisfied: tqdm in c:\\winapp\\miniconda3\\lib\\site-packages (from nltk>=3.1->textblob) (4.61.2)\n",
            "Requirement already satisfied: click in c:\\winapp\\miniconda3\\lib\\site-packages (from nltk>=3.1->textblob) (8.0.3)\n",
            "Requirement already satisfied: colorama in c:\\winapp\\miniconda3\\lib\\site-packages (from click->nltk>=3.1->textblob) (0.4.4)\n",
            "Finished.\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "[nltk_data] Downloading package brown to\n",
            "[nltk_data]     C:\\Users\\dmitryso\\AppData\\Roaming\\nltk_data...\n",
            "[nltk_data]   Package brown is already up-to-date!\n",
            "[nltk_data] Downloading package punkt to\n",
            "[nltk_data]     C:\\Users\\dmitryso\\AppData\\Roaming\\nltk_data...\n",
            "[nltk_data]   Package punkt is already up-to-date!\n",
            "[nltk_data] Downloading package wordnet to\n",
            "[nltk_data]     C:\\Users\\dmitryso\\AppData\\Roaming\\nltk_data...\n",
            "[nltk_data]   Package wordnet is already up-to-date!\n",
            "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
            "[nltk_data]     C:\\Users\\dmitryso\\AppData\\Roaming\\nltk_data...\n",
            "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
            "[nltk_data]       date!\n",
            "[nltk_data] Downloading package conll2000 to\n",
            "[nltk_data]     C:\\Users\\dmitryso\\AppData\\Roaming\\nltk_data...\n",
            "[nltk_data]   Package conll2000 is already up-to-date!\n",
            "[nltk_data] Downloading package movie_reviews to\n",
            "[nltk_data]     C:\\Users\\dmitryso\\AppData\\Roaming\\nltk_data...\n",
            "[nltk_data]   Package movie_reviews is already up-to-date!\n"
          ]
        }
      ],
      "source": [
        "import sys\n",
        "!{sys.executable} -m pip install textblob\n",
        "!{sys.executable} -m textblob.download_corpora\n",
        "from textblob import TextBlob"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "trusted": true
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'covid-19 live updates': 1,\n",
              " 'vaccines': 1,\n",
              " 'boosters': 1,\n",
              " 'york': 4,\n",
              " 'ukrainians flee mariupol': 1,\n",
              " 'forces push': 1,\n",
              " 'port city': 1,\n",
              " 'wall street journal': 3,\n",
              " 'bond yields': 1,\n",
              " 'futures rise': 1,\n",
              " 'powell says fed': 1,\n",
              " 'ready': 1,\n",
              " 'be': 1,\n",
              " 'aggressive': 1,\n",
              " 'putin': 3,\n",
              " 'alexei navalny': 2,\n",
              " 'russian': 2,\n",
              " 'supreme court nominee': 1,\n",
              " 'ketanji brown jackson': 1,\n",
              " \"confirmation hearing 's\": 1,\n",
              " 'cnn': 1,\n",
              " 'swedish': 1,\n",
              " 'high school': 1,\n",
              " 'abc': 1,\n",
              " 'clues': 1,\n",
              " 'covid-19': 1,\n",
              " '’ s': 2,\n",
              " 'moves': 1,\n",
              " 'sewers': 1,\n",
              " 'roll dice': 1,\n",
              " 'jackson': 1,\n",
              " 'decisions |': 1,\n",
              " 'thehill': 1,\n",
              " 'clear': 2,\n",
              " 'chemical weapons': 2,\n",
              " 'ukraine': 3,\n",
              " 'claims president': 2,\n",
              " 'biden': 2,\n",
              " 'nasa': 2,\n",
              " 'solar system': 2,\n",
              " 'daily mail': 3,\n",
              " 'us stocks': 1,\n",
              " 'fed chair powell': 1,\n",
              " \"'s remarks\": 1,\n",
              " 'fox': 1,\n",
              " \"'we 've\": 1,\n",
              " 'tests': 1,\n",
              " 'covid': 1,\n",
              " 'politico': 1,\n",
              " 'duchess': 1,\n",
              " 'cambridge': 1,\n",
              " 'swaps khaki jungle gear': 1,\n",
              " 'vampire': 1,\n",
              " 'wife': 1,\n",
              " 'belize': 1,\n",
              " 'china': 2,\n",
              " 'flight recorders': 1,\n",
              " 'plane crash': 1,\n",
              " 'reuters': 2,\n",
              " 'russian oligarch': 1,\n",
              " 'abramovich': 1,\n",
              " 'live': 1,\n",
              " 'russia': 2,\n",
              " 'stops talks': 1,\n",
              " 'japan': 1,\n",
              " 'español': 1,\n",
              " 'powers remain': 1,\n",
              " 'threats lurk': 1,\n",
              " 'set': 1,\n",
              " 'webb': 1,\n",
              " 'telescope begins multi-instrument alignment': 1,\n",
              " 'scitechdaily': 1,\n",
              " 'uconn': 1,\n",
              " 'ucf': 1,\n",
              " 'ncaa': 1,\n",
              " \"women 's tournament second-round highlights\": 1,\n",
              " 'march madness': 1,\n",
              " 'bucking republican trend': 1,\n",
              " 'indiana': 1,\n",
              " 'vetoes transgender': 1,\n",
              " 'bill': 1,\n",
              " 'maggie fox': 1,\n",
              " 'coronation': 1,\n",
              " 'shameless': 1,\n",
              " \"'sudden accident\": 1,\n",
              " 'mirror online': 1,\n",
              " 'mirror': 2,\n",
              " 'plane crash –': 1,\n",
              " 'search': 1,\n",
              " 'moment flight': 1,\n",
              " 'daniel morgan': 1,\n",
              " 'report condemns': 1,\n",
              " 'met': 1,\n",
              " 'guardian': 6,\n",
              " 'rishi sunak': 1,\n",
              " '’ s spring': 1,\n",
              " 'statement': 1,\n",
              " 'bbc.com': 1,\n",
              " 'uk': 3,\n",
              " 'ireland': 1,\n",
              " 'euro': 1,\n",
              " 'vladimir putin': 2,\n",
              " \"'s 'lover\": 1,\n",
              " 'brass eye': 1,\n",
              " '’ s outtakes': 1,\n",
              " 'brutal tv comedy': 1,\n",
              " 'threatens civilians': 1,\n",
              " 'mariupol': 1,\n",
              " \"'s spirit\": 1,\n",
              " 'shell u-turn': 1,\n",
              " 'cambo': 1,\n",
              " 'green targets': 1,\n",
              " 'st helens': 1,\n",
              " 'dog attack': 1,\n",
              " 'girl': 1,\n",
              " 'bbc': 1,\n",
              " 'playstation': 1,\n",
              " \"'assassin 's\": 1,\n",
              " 'creed': 1,\n",
              " 'jade raymond': 1,\n",
              " 'haven studios': 1,\n",
              " 'nme': 1,\n",
              " 'nintendo switch': 1,\n",
              " 'folders •': 1,\n",
              " 'eurogamer.net': 2,\n",
              " 'fa': 1,\n",
              " 'solution ”': 1,\n",
              " 'liverpool': 1,\n",
              " 'fan group blasts “ shambolic ”': 1,\n",
              " 'wembley': 1,\n",
              " 'anfield': 1,\n",
              " 'manchester': 1,\n",
              " 'live erik': 1,\n",
              " 'hag': 1,\n",
              " 'utd': 1,\n",
              " 'manager updates': 1,\n",
              " 'manchester evening': 1,\n",
              " 'inflation': 1,\n",
              " 'government borrowing': 1,\n",
              " 'february': 1,\n",
              " 'crude oil': 1,\n",
              " '– business': 1,\n",
              " 'kremlin': 1,\n",
              " 'large-scale fraud': 1,\n",
              " 'sky': 1,\n",
              " 'natural gas': 1,\n",
              " 'gazprom': 1,\n",
              " 'retail unit': 1,\n",
              " 'insider': 1,\n",
              " 'zaghari-ratcliffe': 1,\n",
              " 'hunt': 1,\n",
              " 'iran': 1,\n",
              " 'debt payment': 1}"
            ]
          },
          "execution_count": 22,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "w = {}\n",
        "for x in all_titles:\n",
        "    for n in TextBlob(x).noun_phrases:\n",
        "        if n in w:\n",
        "            w[n].append(x)\n",
        "        else:\n",
        "            w[n]=[x]\n",
        "{ x:len(w[x]) for x in w.keys()}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can see that nouns do not give us large thematic groups. Let's substitute nouns by more general terms obtained from the concept graph. This will take some time, because we are doing REST call for each noun phrase."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {
        "trusted": true
      },
      "outputs": [],
      "source": [
        "w = {}\n",
        "for x in all_titles:\n",
        "    for noun in TextBlob(x).noun_phrases:\n",
        "        terms = query(noun.replace(' ','%20'))\n",
        "        for term in [u for u in terms.keys() if terms[u]>0.1]:\n",
        "            if term in w:\n",
        "                w[term].append(x)\n",
        "            else:\n",
        "                w[term]=[x]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {
        "trusted": true
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'city': 9,\n",
              " 'brand': 4,\n",
              " 'place': 9,\n",
              " 'town': 4,\n",
              " 'factor': 4,\n",
              " 'film': 4,\n",
              " 'nation': 11,\n",
              " 'state': 5,\n",
              " 'person': 4,\n",
              " 'organization': 5,\n",
              " 'publication': 10,\n",
              " 'market': 5,\n",
              " 'economy': 4,\n",
              " 'company': 6,\n",
              " 'newspaper': 6,\n",
              " 'relationship': 6}"
            ]
          },
          "execution_count": 24,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "{ x:len(w[x]) for x in w.keys() if len(w[x])>3}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {
        "trusted": true
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "ECONOMY:\n",
            "China searches for victims, flight recorders after first plane crash in 12 years - Reuters\n",
            "Live updates: Russia stops talks with Japan over sanctions - The Associated Press - en Español\n",
            "China plane crash – live: Search for survivors continues as witness describes moment flight fell from sky - The Independent\n",
            "UK prepares to nationalize Russia natural gas giant Gazprom's retail unit - Business Insider\n",
            "\n",
            "NATION:\n",
            "‘Clear sign’ Putin considering using chemical weapons in Ukraine, claims President Biden - The Independent\n",
            "Duchess of Cambridge swaps khaki jungle gear for Vampire's Wife dress on Belize trip - Daily Mail\n",
            "China searches for victims, flight recorders after first plane crash in 12 years - Reuters\n",
            "Live updates: Russia stops talks with Japan over sanctions - The Associated Press - en Español\n",
            "Live updates: Russia stops talks with Japan over sanctions - The Associated Press - en Español\n",
            "China plane crash – live: Search for survivors continues as witness describes moment flight fell from sky - The Independent\n",
            "UK and Republic of Ireland in line to host Euro 2028 after no one else bids - The Guardian\n",
            "Friends beg Vladimir Putin's 'lover' to persuade him to end Ukraine invasion - The Mirror\n",
            "‘Clear sign’ Putin considering using chemical weapons in Ukraine, claims President Biden - The Independent\n",
            "UK prepares to nationalize Russia natural gas giant Gazprom's retail unit - Business Insider\n",
            "Zaghari-Ratcliffe: Hunt calls for inquiry into delay over Iran debt payment - The Guardian\n",
            "\n",
            "PERSON:\n",
            "‘Clear sign’ Putin considering using chemical weapons in Ukraine, claims President Biden - The Independent\n",
            "Duchess of Cambridge swaps khaki jungle gear for Vampire's Wife dress on Belize trip - Daily Mail\n",
            "Second superyacht linked to Russian oligarch Abramovich docks in Turkey - Reuters\n",
            "‘Clear sign’ Putin considering using chemical weapons in Ukraine, claims President Biden - The Independent\n"
          ]
        }
      ],
      "source": [
        "print('\\nECONOMY:\\n'+'\\n'.join(w['economy']))\n",
        "print('\\nNATION:\\n'+'\\n'.join(w['nation']))\n",
        "print('\\nPERSON:\\n'+'\\n'.join(w['person']))"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3.7.4 64-bit (conda)",
      "metadata": {
        "interpreter": {
          "hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5"
        }
      },
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
